Skip to content

fix: mypy errors and flaky subscriber/publisher tests#1258

Merged
bjsowa merged 2 commits into
RobotWebTools:ros2from
PickNikRobotics:fix/rolling-lyrical-ci
May 22, 2026
Merged

fix: mypy errors and flaky subscriber/publisher tests#1258
bjsowa merged 2 commits into
RobotWebTools:ros2from
PickNikRobotics:fix/rolling-lyrical-ci

Conversation

@JWhitleyWork
Copy link
Copy Markdown
Contributor

@JWhitleyWork JWhitleyWork commented May 18, 2026

Summary

Pre-existing CI hygiene on Rolling and Lyrical that has been red on ros2 since the move to ubuntu:resolute:

  • Newer mypy (in resolute's package set) flags 5 latent typing issues that older mypy let slide — one in rosbridge_library/capabilities/subscribe.py, four in rosbridge_server/websocket_handler.py. None are caused by the changes in this branch; they are simply now load-bearing for any PR that touches a Rolling/Lyrical job.
  • DDS discovery on resolute overran the topic-existence test deadlines. The helpers in rosbridge_library/util/ros.py polled the rmw graph cache via get_(publisher|subscriber)_names_and_types_by_node, which requires a discovery round-trip even for self-queries. Different subscriber / publisher tests flaked across different runs.

Dependency

This PR stacks on top of #1255 (the service / action lifecycle race fix). #1255 must land first; without it test_action_capabilities SIGSEGVs and the rest of the Rolling/Lyrical job is moot.

Approach

mypy

Smallest possible patches at each error site — no broader refactor, no library-wide annotation pass.

  • subscribe.py:61 — extend the type: ignore on the simplejson fallback to also cover no-redef. The chained from X import Y as encode_json pattern legitimately rebinds the name; the new mypy version requires the suppression to say so.
  • websocket_handler.py::_log_exception — narrow RosbridgeWebSocket.node_handle with the same assert isinstance(..., Node) pattern already used in open() / on_close() before calling .get_logger().
  • websocket_handler.py::protocol_parameters — annotate the bare ClassVar as ClassVar[dict[str, Any]] to match the value it's actually assigned in rosbridge_websocket.py.
  • websocket_handler.py::open / on_close — pass self.request.remote_ip or "" to ClientManager.add_client / remove_client. Tornado types remote_ip as Any | None; the manager signature is str.

Topic-existence checks

The polling-based assert_topic_(not_)subscribed helpers, and the rmw-graph-cache queries they were built around, were always going to flake. The graph cache is updated by DDS discovery events; even for a node querying its own subscriptions, that's an asynchronous round-trip, and the resolute package set's defaults make the round-trip slower.

The deterministic alternative — and the one this PR adopts — is to read the node's own entity list directly:

  • is_topic_published(node, name)any(p.topic_name == name for p in node.publishers)
  • is_topic_subscribed(node, name)any(s.topic_name == name for s in node.subscriptions)

Both node.publishers and node.subscriptions are populated synchronously inside create_publisher / create_subscription and cleared inside destroy_publisher / destroy_subscription, so the result is correct as soon as the calling thread has the entity handle — no DDS involvement.

The one remaining asynchrony is that MultiSubscriber.unregister routes destroy_subscription through executor.create_task (added in #1194). After manager.unsubscribe, the destroy task is queued but may not have run yet from the test thread's point of view. wait_for_executor_idle handles this: it submits a no-op task and waits for it to run. SingleThreadedExecutor processes tasks FIFO, so when the no-op completes every task enqueued before it has also completed.

Why not launch_testing?

launch_testing is for testing systems that span processes. The tests here exercise the in-process SubscriberManager / PublisherManager API on a single node — moving them under launch_testing would impose a per-test process boundary and a much larger refactor without changing the determinism story. node.publishers / node.subscriptions give the same guarantee with no new infrastructure.

What changed

  • rosbridge_library/util/ros.py — rewrite is_topic_published / is_topic_subscribed against node.publishers / node.subscriptions; add wait_for_executor_idle.
  • Drop the assert_topic_(not_)subscribed polling helpers from the two subscriber test files; replace with direct assertTrue / assertFalse against is_topic_subscribed. Add wait_for_executor_idle(self.executor) after each manager.unsubscribe / multi.unregister before asserting the entity is gone.
  • Remove the time.sleep(0.1) before is_topic_published in test_publisher_manager.test_register_infer_topictypecreate_publisher is now reflected immediately. The time.sleep(manager.unregister_timeout + 1.0) waits stay; those wait on threading.Timer, not graph propagation.
  • subscribe.py, websocket_handler.py — the mypy fixes described above.

Local validation

Reproduced on a ros:rolling-perception container with current rclpy:

@bjsowa
Copy link
Copy Markdown
Member

bjsowa commented May 18, 2026

2. DDS discovery is slower on resolute

I love how the AI just assumed that's the case without any proof. The problem seems to be more complicated than that.

@JWhitleyWork
Copy link
Copy Markdown
Contributor Author

  1. DDS discovery is slower on resolute

I love how the AI just assumed that's the case without any proof. The problem seems to be more complicated than that.

Yeah, agreed. It's churning now on using a more deterministic solution (at my prompting) like launch_testing.

@JWhitleyWork JWhitleyWork force-pushed the fix/rolling-lyrical-ci branch 4 times, most recently from e90ddf5 to 35e5b6d Compare May 18, 2026 23:52
@JWhitleyWork
Copy link
Copy Markdown
Contributor Author

I think it did it...

@JWhitleyWork JWhitleyWork marked this pull request as ready for review May 20, 2026 05:34
@bjsowa
Copy link
Copy Markdown
Member

bjsowa commented May 20, 2026

Is the first commit needed to make the tests pass?

@JWhitleyWork
Copy link
Copy Markdown
Contributor Author

Is the first commit needed to make the tests pass?

It is, yes. That commit is what makes up #1255, which still needs to go into humble and jazzy.

@bjsowa
Copy link
Copy Markdown
Member

bjsowa commented May 21, 2026

I tried cherry-picking 35e5b6d and b58fb2b and that seems to fix the tests. Not sure why we need to include #1255 in this PR

Two independent root causes block any PR that touches a Rolling/Lyrical job:

mypy (Ubuntu 25.10 'resolute' ships a stricter mypy that flags 5 latent
issues older versions silently accepted):
- subscribe.py: extend the type-ignore on the simplejson fallback to also
  cover no-redef. The chained 'import as encode_json' pattern legitimately
  rebinds the name; newer mypy now requires the suppression to say so.
- websocket_handler.py: narrow Node | None at _log_exception's use site
  with the same assert pattern already in open()/on_close(); annotate the
  bare protocol_parameters ClassVar as dict[str, Any]; coerce the optional
  request.remote_ip to a str at the ClientManager.{add,remove}_client
  boundary (Tornado types it as Any | None).

Flaky topic-existence tests: the previous helpers polled the rmw graph
cache via get_(publisher|subscriber)_names_and_types_by_node, which
requires a DDS-discovery round-trip even for self-queries. On 'resolute'
that round-trip routinely overran the 1s polling deadline, so different
tests flaked across different runs.

Switch to a deterministic implementation:
- is_topic_published / is_topic_subscribed in util/ros.py now read
  node.publishers / node.subscriptions (the local entity list). These are
  populated synchronously inside create_publisher / create_subscription
  and cleared inside destroy_publisher / destroy_subscription, so the
  result is correct as soon as the calling thread has the entity handle.
- Drop the assert_topic_(not_)subscribed polling helpers from the
  subscriber tests; call is_topic_subscribed directly.
- subscribers.py routes destroy_subscription through executor.create_task
  (RobotWebTools#1194), so after manager.unsubscribe / MultiSubscriber.unregister the
  test thread still has to synchronize with the executor before asserting
  the entity is gone. Add wait_for_executor_idle, which enqueues a no-op
  task and waits for it; SingleThreadedExecutor processes tasks FIFO, so
  any destroy task enqueued earlier has completed when it returns.
- Drop the time.sleep(0.1) before is_topic_published in
  test_publisher_manager.test_register_infer_topictype — create_publisher
  is now reflected immediately.

Existing time.sleep(manager.unregister_timeout + 1.0) waits stay: those
are waiting for threading.Timer to fire and run destroy_publisher, not
for graph propagation.
test_capabilities was hitting CTest's 60-second timeout on Rolling and Lyrical
even though pytest itself completes in ~14s and reports 46/46 passing. The
remaining ~46s is wasted in Python interpreter shutdown waiting on non-daemon
worker threads.

Each of the action / service capability tests dispatches the rosbridge worker
via Thread(target=self.send_goal.send_action_goal, ...).start() (analogous to
the production protocol.register_operation path). The worker calls
SendGoal.send_goal / call_service.call_service, which busy-wait on a result
that is only delivered when the executor spins the registered callback. When
tearDown calls executor.shutdown() before the worker's result callback has
fired, the busy-wait runs forever. Because the dispatch threads are
non-daemon, the Python interpreter blocks at shutdown trying to join them.

Mark these worker threads daemon=True so process exit is not gated on them.
The threads still complete normally on the happy path; daemon only changes
behaviour at interpreter shutdown, mirroring the pattern already used in
test_stress_service_clients.py.
@JWhitleyWork JWhitleyWork force-pushed the fix/rolling-lyrical-ci branch from b58fb2b to 4a2dc1c Compare May 21, 2026 15:37
@JWhitleyWork
Copy link
Copy Markdown
Contributor Author

I tried cherry-picking 35e5b6d and b58fb2b and that seems to fix the tests. Not sure why we need to include #1255 in this PR

Fair point. Claude told me it was required and I mistakenly trusted it without verification. I dropped that commit from this PR so hopefully CI will pass.

@bjsowa bjsowa changed the title Fix: Pre-existing Rolling/Lyrical CI failures (mypy + flaky subscriber tests) fix: mypy errors and flaky subscriber/publisher tests May 22, 2026
@bjsowa bjsowa merged commit efc9f37 into RobotWebTools:ros2 May 22, 2026
4 checks passed
@bjsowa
Copy link
Copy Markdown
Member

bjsowa commented May 22, 2026

@Mergifyio backport kilted

@mergify
Copy link
Copy Markdown

mergify Bot commented May 22, 2026

backport kilted

✅ Backports have been created

Details

Cherry-pick of efc9f37 has failed:

On branch mergify/bp/kilted/pr-1258
Your branch is up to date with 'origin/kilted'.

You are currently cherry-picking commit efc9f37.
  (fix conflicts and run "git cherry-pick --continue")
  (use "git cherry-pick --skip" to skip this patch)
  (use "git cherry-pick --abort" to cancel the cherry-pick operation)

Changes to be committed:
	modified:   rosbridge_library/src/rosbridge_library/capabilities/subscribe.py
	modified:   rosbridge_library/src/rosbridge_library/util/ros.py
	modified:   rosbridge_library/test/capabilities/test_qos.py
	modified:   rosbridge_library/test/capabilities/test_service_capabilities.py
	modified:   rosbridge_library/test/internal/publishers/test_publisher_manager.py
	modified:   rosbridge_library/test/internal/subscribers/test_multi_subscriber.py
	modified:   rosbridge_library/test/internal/subscribers/test_subscriber_manager.py
	modified:   rosbridge_server/src/rosbridge_server/websocket_handler.py

Unmerged paths:
  (use "git add <file>..." to mark resolution)
	both modified:   rosbridge_library/test/capabilities/test_action_capabilities.py

To fix up this pull request, you can check it out locally. See documentation: https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/reviewing-changes-in-pull-requests/checking-out-pull-requests-locally

bjsowa added a commit that referenced this pull request May 22, 2026
#1261)

* fix: mypy errors and flaky subscriber/publisher tests (#1258)

* chore: fix pre-existing Rolling/Lyrical CI failures

Two independent root causes block any PR that touches a Rolling/Lyrical job:

mypy (Ubuntu 25.10 'resolute' ships a stricter mypy that flags 5 latent
issues older versions silently accepted):
- subscribe.py: extend the type-ignore on the simplejson fallback to also
  cover no-redef. The chained 'import as encode_json' pattern legitimately
  rebinds the name; newer mypy now requires the suppression to say so.
- websocket_handler.py: narrow Node | None at _log_exception's use site
  with the same assert pattern already in open()/on_close(); annotate the
  bare protocol_parameters ClassVar as dict[str, Any]; coerce the optional
  request.remote_ip to a str at the ClientManager.{add,remove}_client
  boundary (Tornado types it as Any | None).

Flaky topic-existence tests: the previous helpers polled the rmw graph
cache via get_(publisher|subscriber)_names_and_types_by_node, which
requires a DDS-discovery round-trip even for self-queries. On 'resolute'
that round-trip routinely overran the 1s polling deadline, so different
tests flaked across different runs.

Switch to a deterministic implementation:
- is_topic_published / is_topic_subscribed in util/ros.py now read
  node.publishers / node.subscriptions (the local entity list). These are
  populated synchronously inside create_publisher / create_subscription
  and cleared inside destroy_publisher / destroy_subscription, so the
  result is correct as soon as the calling thread has the entity handle.
- Drop the assert_topic_(not_)subscribed polling helpers from the
  subscriber tests; call is_topic_subscribed directly.
- subscribers.py routes destroy_subscription through executor.create_task
  (#1194), so after manager.unsubscribe / MultiSubscriber.unregister the
  test thread still has to synchronize with the executor before asserting
  the entity is gone. Add wait_for_executor_idle, which enqueues a no-op
  task and waits for it; SingleThreadedExecutor processes tasks FIFO, so
  any destroy task enqueued earlier has completed when it returns.
- Drop the time.sleep(0.1) before is_topic_published in
  test_publisher_manager.test_register_infer_topictype — create_publisher
  is now reflected immediately.

Existing time.sleep(manager.unregister_timeout + 1.0) waits stay: those
are waiting for threading.Timer to fire and run destroy_publisher, not
for graph propagation.

* test: daemonize worker threads in capability tests

test_capabilities was hitting CTest's 60-second timeout on Rolling and Lyrical
even though pytest itself completes in ~14s and reports 46/46 passing. The
remaining ~46s is wasted in Python interpreter shutdown waiting on non-daemon
worker threads.

Each of the action / service capability tests dispatches the rosbridge worker
via Thread(target=self.send_goal.send_action_goal, ...).start() (analogous to
the production protocol.register_operation path). The worker calls
SendGoal.send_goal / call_service.call_service, which busy-wait on a result
that is only delivered when the executor spins the registered callback. When
tearDown calls executor.shutdown() before the worker's result callback has
fired, the busy-wait runs forever. Because the dispatch threads are
non-daemon, the Python interpreter blocks at shutdown trying to join them.

Mark these worker threads daemon=True so process exit is not gated on them.
The threads still complete normally on the happy path; daemon only changes
behaviour at interpreter shutdown, mirroring the pattern already used in
test_stress_service_clients.py.

(cherry picked from commit efc9f37)

# Conflicts:
#	rosbridge_library/test/capabilities/test_action_capabilities.py

* Fix conflicts

---------

Co-authored-by: Joshua Whitley <josh@electrifiedautonomy.com>
Co-authored-by: Błażej Sowa <bsowa123@gmail.com>
bjsowa added a commit that referenced this pull request May 22, 2026
#1261) (#1262)

* fix: mypy errors and flaky subscriber/publisher tests (#1258)

* chore: fix pre-existing Rolling/Lyrical CI failures

Two independent root causes block any PR that touches a Rolling/Lyrical job:

mypy (Ubuntu 25.10 'resolute' ships a stricter mypy that flags 5 latent
issues older versions silently accepted):
- subscribe.py: extend the type-ignore on the simplejson fallback to also
  cover no-redef. The chained 'import as encode_json' pattern legitimately
  rebinds the name; newer mypy now requires the suppression to say so.
- websocket_handler.py: narrow Node | None at _log_exception's use site
  with the same assert pattern already in open()/on_close(); annotate the
  bare protocol_parameters ClassVar as dict[str, Any]; coerce the optional
  request.remote_ip to a str at the ClientManager.{add,remove}_client
  boundary (Tornado types it as Any | None).

Flaky topic-existence tests: the previous helpers polled the rmw graph
cache via get_(publisher|subscriber)_names_and_types_by_node, which
requires a DDS-discovery round-trip even for self-queries. On 'resolute'
that round-trip routinely overran the 1s polling deadline, so different
tests flaked across different runs.

Switch to a deterministic implementation:
- is_topic_published / is_topic_subscribed in util/ros.py now read
  node.publishers / node.subscriptions (the local entity list). These are
  populated synchronously inside create_publisher / create_subscription
  and cleared inside destroy_publisher / destroy_subscription, so the
  result is correct as soon as the calling thread has the entity handle.
- Drop the assert_topic_(not_)subscribed polling helpers from the
  subscriber tests; call is_topic_subscribed directly.
- subscribers.py routes destroy_subscription through executor.create_task
  (#1194), so after manager.unsubscribe / MultiSubscriber.unregister the
  test thread still has to synchronize with the executor before asserting
  the entity is gone. Add wait_for_executor_idle, which enqueues a no-op
  task and waits for it; SingleThreadedExecutor processes tasks FIFO, so
  any destroy task enqueued earlier has completed when it returns.
- Drop the time.sleep(0.1) before is_topic_published in
  test_publisher_manager.test_register_infer_topictype — create_publisher
  is now reflected immediately.

Existing time.sleep(manager.unregister_timeout + 1.0) waits stay: those
are waiting for threading.Timer to fire and run destroy_publisher, not
for graph propagation.

* test: daemonize worker threads in capability tests

test_capabilities was hitting CTest's 60-second timeout on Rolling and Lyrical
even though pytest itself completes in ~14s and reports 46/46 passing. The
remaining ~46s is wasted in Python interpreter shutdown waiting on non-daemon
worker threads.

Each of the action / service capability tests dispatches the rosbridge worker
via Thread(target=self.send_goal.send_action_goal, ...).start() (analogous to
the production protocol.register_operation path). The worker calls
SendGoal.send_goal / call_service.call_service, which busy-wait on a result
that is only delivered when the executor spins the registered callback. When
tearDown calls executor.shutdown() before the worker's result callback has
fired, the busy-wait runs forever. Because the dispatch threads are
non-daemon, the Python interpreter blocks at shutdown trying to join them.

Mark these worker threads daemon=True so process exit is not gated on them.
The threads still complete normally on the happy path; daemon only changes
behaviour at interpreter shutdown, mirroring the pattern already used in
test_stress_service_clients.py.

(cherry picked from commit efc9f37)

# Conflicts:
#	rosbridge_library/test/capabilities/test_action_capabilities.py

* Fix conflicts

---------



(cherry picked from commit 65034c7)

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>
Co-authored-by: Joshua Whitley <josh@electrifiedautonomy.com>
Co-authored-by: Błażej Sowa <bsowa123@gmail.com>
bjsowa added a commit that referenced this pull request May 22, 2026
#1263)

* fix: mypy errors and flaky subscriber/publisher tests (backport #1258) (#1261)

* fix: mypy errors and flaky subscriber/publisher tests (#1258)

* chore: fix pre-existing Rolling/Lyrical CI failures

Two independent root causes block any PR that touches a Rolling/Lyrical job:

mypy (Ubuntu 25.10 'resolute' ships a stricter mypy that flags 5 latent
issues older versions silently accepted):
- subscribe.py: extend the type-ignore on the simplejson fallback to also
  cover no-redef. The chained 'import as encode_json' pattern legitimately
  rebinds the name; newer mypy now requires the suppression to say so.
- websocket_handler.py: narrow Node | None at _log_exception's use site
  with the same assert pattern already in open()/on_close(); annotate the
  bare protocol_parameters ClassVar as dict[str, Any]; coerce the optional
  request.remote_ip to a str at the ClientManager.{add,remove}_client
  boundary (Tornado types it as Any | None).

Flaky topic-existence tests: the previous helpers polled the rmw graph
cache via get_(publisher|subscriber)_names_and_types_by_node, which
requires a DDS-discovery round-trip even for self-queries. On 'resolute'
that round-trip routinely overran the 1s polling deadline, so different
tests flaked across different runs.

Switch to a deterministic implementation:
- is_topic_published / is_topic_subscribed in util/ros.py now read
  node.publishers / node.subscriptions (the local entity list). These are
  populated synchronously inside create_publisher / create_subscription
  and cleared inside destroy_publisher / destroy_subscription, so the
  result is correct as soon as the calling thread has the entity handle.
- Drop the assert_topic_(not_)subscribed polling helpers from the
  subscriber tests; call is_topic_subscribed directly.
- subscribers.py routes destroy_subscription through executor.create_task
  (#1194), so after manager.unsubscribe / MultiSubscriber.unregister the
  test thread still has to synchronize with the executor before asserting
  the entity is gone. Add wait_for_executor_idle, which enqueues a no-op
  task and waits for it; SingleThreadedExecutor processes tasks FIFO, so
  any destroy task enqueued earlier has completed when it returns.
- Drop the time.sleep(0.1) before is_topic_published in
  test_publisher_manager.test_register_infer_topictype — create_publisher
  is now reflected immediately.

Existing time.sleep(manager.unregister_timeout + 1.0) waits stay: those
are waiting for threading.Timer to fire and run destroy_publisher, not
for graph propagation.

* test: daemonize worker threads in capability tests

test_capabilities was hitting CTest's 60-second timeout on Rolling and Lyrical
even though pytest itself completes in ~14s and reports 46/46 passing. The
remaining ~46s is wasted in Python interpreter shutdown waiting on non-daemon
worker threads.

Each of the action / service capability tests dispatches the rosbridge worker
via Thread(target=self.send_goal.send_action_goal, ...).start() (analogous to
the production protocol.register_operation path). The worker calls
SendGoal.send_goal / call_service.call_service, which busy-wait on a result
that is only delivered when the executor spins the registered callback. When
tearDown calls executor.shutdown() before the worker's result callback has
fired, the busy-wait runs forever. Because the dispatch threads are
non-daemon, the Python interpreter blocks at shutdown trying to join them.

Mark these worker threads daemon=True so process exit is not gated on them.
The threads still complete normally on the happy path; daemon only changes
behaviour at interpreter shutdown, mirroring the pattern already used in
test_stress_service_clients.py.

(cherry picked from commit efc9f37)

# Conflicts:
#	rosbridge_library/test/capabilities/test_action_capabilities.py

* Fix conflicts

---------

Co-authored-by: Joshua Whitley <josh@electrifiedautonomy.com>
Co-authored-by: Błażej Sowa <bsowa123@gmail.com>
(cherry picked from commit 65034c7)

# Conflicts:
#	rosbridge_library/src/rosbridge_library/capabilities/subscribe.py
#	rosbridge_library/test/internal/subscribers/test_multi_subscriber.py
#	rosbridge_server/src/rosbridge_server/websocket_handler.py

* Fix conflicts

---------

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>
Co-authored-by: Błażej Sowa <bsowa123@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants